Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
The paper introduces the first formulation of convex Q-learning for Markov decision processes with function approximation. The algorithms and theory rest on a relaxation of a dual of Manne's celebrated linear programming characterization of optimal control. The main contributions firstly concern properties of the relaxation, described as a deterministic convex program: we identify conditions for a bounded solution, a significant connection between the solution to the new convex program, and the solution to standard Q-learning with linear function approximation. The second set of contributions concern algorithm design and analysis: (i) A direct model-free method for approximating the convex program for Q-learning shares properties with its ideal. In particular, a bounded solution is ensured subject to a simple property of the basis functions; (ii) The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a mean-square sense; (iii) The approach can be generalized to a range of performance criteria, and it is found that variance can be reduced by considering ``relative'' dynamic programming equations; (iv) The theory is illustrated with an application to a classical inventory control problem.more » « less
-
We have all heard that there is growing need to secure resources to obtain supply-demand balance in a power grid facing increasing volatility from renewable sources of energy. There are mandates for utility scale battery systems in regions all over the world, and there is a growing science of “demand dispatch” to obtain virtual energy storage from flexible electric loads such as water heaters, air conditioning, and pumps for irrigation. The question addressed in this tutorial is how to manage a large number of assets for balancing the grid. The focus is on variants of the economic dispatch problem, which may be regarded as the “feed-forward” component in an overall control architecture. 1) The resource allocation problem is identical to a finite horizon optimal control problem with degenerate cost—so called “cheap control”. This implies a form of state space collapse, whose form is identified: the marginal cost for each load class evolves in a two-dimensional subspace, spanned by a scalar co-state process and its derivative. 2) The implication to distributed control is remarkable. Once the co-state process is synthesized, this common signal may be broadcast to each asset for optimal control. However, the optimal solution is extremely fragile, in a sense made clear through results from numerical studies. 3) Several remedies are proposed to address fragility. One is described through “robust training” in a particular Q-learning architecture (one approach to reinforcement learning). In numerical studies it is found that specialized training leads to more robust control solutions.more » « less
-
Convex Q-learning is a recent approach to reinforcement learning, motivated by the possibility of a firmer theory for convergence, and the possibility of making use of greater a priori knowledge regarding policy or value function structure. This paper explores algorithm design in the continuous time domain, with a finite-horizon optimal control objective. The main contributions are (i) The new Q-ODE: a model-free characterization of the Hamilton-Jacobi-Bellman equation. (ii) A formulation of Convex Q-learning that avoids approximations appearing in prior work. The Bellman error used in the algorithm is defined by filtered measurements, which is necessary in the presence of measurement noise. (iii) Convex Q-learning with linear function approximation is a convex program. It is shown that the constraint region is bounded, subject to an exploration condition on the training input. (iv) The theory is illustrated in application to resource allocation for distributed energy resources, for which the theory is ideally suited.more » « less
-
We propose a deep reinforcement learning (DRL) methodology for the tracking, obstacle avoidance, and formation control of nonholonomic robots. By separating vision-based control into a perception module and a controller module, we can train a DRL agent without sophisticated physics or 3D modeling. In addition, the modular framework averts daunting retrains of an image-to-action end-to-end neural network, and provides flexibility in transferring the controller to different robots. First, we train a convolutional neural network (CNN) to accurately localize in an indoor setting with dynamic foreground/background. Then, we design a new DRL algorithm named Momentum Policy Gradient (MPG) for continuous control tasks and prove its convergence. We also show that MPG is robust at tracking varying leader movements and can naturally be extended to problems of formation control. Leveraging reward shaping, features such as collision and obstacle avoidance can be easily integrated into a DRL controller.more » « less
An official website of the United States government
